Introduction

library(tidyverse)
library(magrittr)
library(lubridate)
library(scales)
library(matrixStats)
library(ggrepel)
library(broom)
library(glue)
library(jsonlite)
library(pander)
library(plotly)
panderOptions("big.mark", ",")
panderOptions("table.split.table", Inf)
panderOptions("table.style", "rmarkdown")
theme_set(theme_bw())

Disclaimer: This very simple report was prepared by a bioinformatician with no experience in epidemiology or virology, and as such should be treated simply as an alternate viewpoint on the data, which I was simply unable to find elsewhere. Many other people exist with much greater expertise on this subject. However, I do hope this provides a useful perspective which is able to add constructively to the wider discussion. In addition, it should be noted that this is very much focussed on Australian data.

Data was sourced from Johns Hopkins University (https://coronavirus.jhu.edu/), using the datasets provided by JHU at https://github.com/CSSEGISandData/COVID-19. JHU data is updated every 24 hours at approximately 23:59(UTC), which is about 10:30AM in Adelaide.

The official government figures at www.health.gov.au appear to lag media reports, likely to ensure accuracy of the official numbers. The official Australian numbers are not used for this analysis, instead relying on those provided by JHU, given that JHU are extremely careful and comprehensive in their sourcing of numbers. A significant disparity is not expected. Live updates for Australia are available from https://covid-19-au.github.io/ for those who would like an up to the minute breakdown of confirmed cases. Additionally, Australian values are updated periodically throughout the day using values provided by https://api.infotorch.org/api/covid19/totals/.

confirmed <- url("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Confirmed.csv") %>%
    read_csv() %>%
    pivot_longer(
        cols = ends_with("20"),
        names_to = "date",
        values_to = "confirmed"
    )  %>%
    mutate(
        date = str_replace_all(
            date, "(.+)/(.+)/(.+)", "20\\3-\\1-\\2"
        ) %>%
            ymd()
    ) %>%
    dplyr::rename(
        Country = `Country/Region`
    ) %>%
    dplyr::mutate(
        Country = case_when(
            `Province/State` == "Hubei" ~ "China (Hubei)",
            `Province/State` == "Hong Kong" ~ "Hong Kong",
            grepl("China", Country) & !`Province/State` %in% c("Hubei", "Hong Kong") ~ "China (Other)",
            !grepl("China", Country) ~ Country
        )
    ) %>%
    dplyr::filter(
        !is.na(confirmed)
    ) 
latestAU <- fromJSON("https://api.infotorch.org/api/covid19/totals/") %>%
  as_tibble() %>%
  rename(
    `Province/State` = state_long,
    date = date_updated
  ) %>%
  mutate(
    date = ymd(date),
    Country = "Australia",
    `Province/State` = str_replace_all(
      `Province/State`, "ACT", "Australian Capital Territory"
    )
  ) %>%
  dplyr::filter(
    !str_detect(`Province/State`, "Total")
  ) %>%
  dplyr::select(
    any_of(colnames(confirmed)), deaths, tested
  )
# Update the AU records if there are any timepoints
# on days beyond the last value provided by JHU
updateAU <- max(latestAU$date) >
  max(
    dplyr::filter(confirmed, Country == "Australia")$date
    )
if (updateAU){
  confirmed %<>%
    bind_rows(
      dplyr::select(latestAU, any_of(colnames(.)))
    )
}

Population sizes for the most impacted countries were manually obtained from https://www.worldometers.info/ and should not be considered as authoritative. Given the disparity of infection within China, China was broken into Hubei Province, Hong Kong and the rest of China. As an open acknowledgement of the crudeness of population values, population estimates for Hubei Province were taken from the 2018 estimates given by Statista.com. This particular population estimate is likely to be low, and as a result confirmed case rates for Hubei province may be an overestimate, and consequently, confirmed case rates for the rest of China may also be an underestimate.

Confirmed cases of COVID-19 as provided by the Chinese Government have been discussed elsewhere as unusual, and data appears potentially unreliable. In this analysis, discussions regarding accurate Chinese reporting are not considered further and data is simply presented as supplied by JHU.

However, all countries are likely to contain many unreported cases given the incomplete testing regimes in place for most countries. Similarly, reporting in many countries may have features that cause concerns regarding data integrity and this makes comparison across countries difficult.

hb <- 59170000
pops <- tribble(
    ~Country, ~Population, ~Region,
    "Australia", 25499884, "Oceania",
    "Austria", 9006398, "Europe",
    "Belgium", 11575214, "Europe",
    "Brazil", 212129490, "South America",
    "Canada", 37742154, "North America",
    "Chile", 19068026, "South America",
    "China (Hubei)", hb, "Asia",
    "China (Other)", 1408526449 - hb, "Asia",
    "Czechia", 10703010, "Europe",
    "Denmark", 5786274, "Europe",
    "Ecuador", 17564089, "South America",
    "Finland", 5538181, "Europe",
    "France", 65273511, "Europe",
    "Germany", 83783942, "Europe",
    "Greece", 10423054, "Europe",
    "Hong Kong", 7479307, "Asia",
    "Iceland", 364260, "Europe",
    "Indonesia", 272702678, "Asia",
    "Italy", 60486925, "Europe",
    "Iran", 83677594, "Middle East",
    "Ireland", 4921810, "Europe",
    "Israel", 8615601, "Middle East",
    "Japan", 126476461, "Asia",
    "Jordan", 10174146, "Middle East",
    "Korea, South", 51269185, "Asia",
    "Luxembourg", 613894, "Europe",
    "Malaysia", 32245488, "Asia",
    "Netherlands", 17134872, "Europe",
    "New Zealand", 4811065, "Oceania",
    "Norway", 5408930, "Europe",
    "Pakistan",  219629013, "Asia",
    "Poland", 37858096, "Europe",
    "Portugal", 10196709, "Europe",
    "Qatar", 2866531, "Middle East",
    "Russia", 145917069, "Europe",
    "Singapore", 5836728, "Asia",
    "Spain", 46754778, "Europe",
    "Sweden", 10081035, "Europe",
    "Switzerland", 8654622, "Europe",
    "Taiwan*", 23804524, "Asia",
    "Turkey", 84081383, "Europe",
    "United Arab Emirates", 9856053, "Middle East", 
    "United Kingdom", 67886011, "Europe",
    "US", 331002651, "North America"
)

International Data

Table of Most Impacted Countries

All information presented in this section is effectively the cumulative, confirmed incidence rate. Recovered patients and those who have passed away are still included in these numbers. Testing rates and results are also not included, and these may be highly informative if able to be sourced accurately.

minPop <- 4e6
confirmed %>%
  group_by(Country, date) %>%
  summarise(confirmed = sum(confirmed)) %>%
  ungroup() %>%
  group_by(Country) %>%
  dplyr::filter(
    date == max(date),
  ) %>%
  ungroup() %>%
  right_join(pops) %>%
  dplyr::filter(
    Population > minPop
  ) %>%
  mutate(
    rate = 1e6*confirmed / Population
  ) %>%
  arrange(desc(rate)) %>%
  dplyr::slice(1:25) %>%
  mutate(
    rate = sprintf("%.1f", rate),
    Population = comma(Population)
  ) %>%
  rename_at(vars(everything()), str_to_title) %>%
  dplyr::select(
    Country, Date, Confirmed, Population, Rate
  ) %>%
  rename(`Rate (Cases per Million)` = Rate) %>%
  pander(
    justify = "lrrrr",
    caption = paste(
      "The", nrow(.), "most impacted countries studied here and shown as a proportion of total population.",
      "Only countries with a population greater than",
      comma(minPop), "are shown.",
      "The final column provides the latest confirmed infection rate as cases per million people.",
      "Whilst the virus spreads with no regard to population size, the rate as shown here indicates the __degree of stress which each country's health-care system is likely to be experiencing__.",
      "Several countries shown here have not attracted much media attention due lower case numbers than China and Italy, but are likely to be experiencing significant duress."
    )
  )
The 25 most impacted countries studied here and shown as a proportion of total population. Only countries with a population greater than 4,000,000 are shown. The final column provides the latest confirmed infection rate as cases per million people. Whilst the virus spreads with no regard to population size, the rate as shown here indicates the degree of stress which each country’s health-care system is likely to be experiencing. Several countries shown here have not attracted much media attention due lower case numbers than China and Italy, but are likely to be experiencing significant duress.
Country Date Confirmed Population Rate (Cases per Million)
China (Hubei) 2020-03-20 67,800 59,170,000 1145.9
Italy 2020-03-20 47,021 60,486,925 777.4
Switzerland 2020-03-20 5,294 8,654,622 611.7
Spain 2020-03-20 20,410 46,754,778 436.5
Norway 2020-03-20 1,914 5,408,930 353.9
Austria 2020-03-20 2,388 9,006,398 265.1
Germany 2020-03-20 19,848 83,783,942 236.9
Iran 2020-03-20 19,644 83,677,594 234.8
Denmark 2020-03-20 1,337 5,786,274 231.1
Belgium 2020-03-20 2,257 11,575,214 195.0
France 2020-03-20 12,726 65,273,511 195.0
Netherlands 2020-03-20 3,003 17,134,872 175.3
Korea, South 2020-03-20 8,652 51,269,185 168.8
Sweden 2020-03-20 1,639 10,081,035 162.6
Ireland 2020-03-20 683 4,921,810 138.8
Portugal 2020-03-20 1,020 10,196,709 100.0
Israel 2020-03-20 705 8,615,601 81.8
Finland 2020-03-20 450 5,538,181 81.3
Czechia 2020-03-20 833 10,703,010 77.8
Singapore 2020-03-20 385 5,836,728 66.0
United Kingdom 2020-03-20 4,014 67,886,011 59.1
US 2020-03-20 19,100 331,002,651 57.7
Greece 2020-03-20 495 10,423,054 47.5
Australia 2020-03-21 1,067 25,499,884 41.8
Hong Kong 2020-03-20 256 7,479,307 34.2

Cumulative Incidence Rates

minToInclude <- 5
startingPoint <- 1
nDays <- confirmed %>%
  dplyr::filter(Country == "Hong Kong") %>% 
  group_by(Country, date) %>%
  summarise(confirmed = sum(confirmed)) %>%
  ungroup() %>%
  left_join(pops) %>%
  mutate(
    rate = 1e6*confirmed / Population
  ) %>% 
  dplyr::filter(rate > startingPoint) %>% 
  nrow() %>%
  subtract(1)
p <- confirmed %>%
  group_by(Country, date) %>%
  summarise(confirmed = sum(confirmed)) %>%
  ungroup() %>%
  right_join(pops) %>%
  mutate(
    rate = 1e6*confirmed / Population
  ) %>%
  group_by(Country) %>%
  dplyr::filter(
    max(rate) > minToInclude
  ) %>%
  ungroup() %>%
  split(f = .$Country) %>%
  lapply(function(x, r = startingPoint){
    std <- min(dplyr::filter(x, rate > r)$date)
    x %>%
      dplyr::filter(date >= std) %>%
      mutate(days = date - std)
  }) %>%
  bind_rows() %>%
  mutate(
    days = as.integer(days),
    rate = round(rate, 2)
  ) %>%
  dplyr::filter(days <= nDays) %>%
  arrange(date) %>%
  mutate(Country = fct_inorder(Country)) %>%
  rename_all(str_to_title) %>%
  ggplot(
    aes(Days, Rate, colour = Country, Date = Date, Confirmed = Confirmed)
  ) +
  geom_segment(
    aes(x, y, xend = xmax, yend = ymax),
    data = tibble(
      x = 0,
      y = 1,
      xmax = c(20.5, 41, nDays - 2),
      ymax = 2^(xmax/ c(2, 4, 8))
    ),
    inherit.aes = FALSE,
    colour = "grey70",
    linetype = 2
  ) + 
  geom_text(
        aes(xmax, ymax, label = label),
    data = tibble(
      xmax = c(20.5, 41, nDays - 2),
      rate = c(2, 4, 8),
      ymax = 2^(xmax/ rate),
      label = glue("Doubling in\n {rate} days")
    ),
    colour = "grey70",
    inherit.aes = FALSE
  ) +
  geom_point() +
  geom_line() +
  scale_x_continuous(
    expand = expand_scale(mult = c(0, 0.05)),
  ) +
  scale_y_log10(
    expand = expand_scale(mult = c(0, 0.05)),
  ) +
  xlab(
    paste(
      "Days since passing", 
      startingPoint,
      "confirmed case/million"
    )
  ) +
  ylab("Confirmed Infection Rate (cases/million)")
ggplotly(
  p, 
  tooltip = c(
    "Days", "Rate", "Country", "Date", "Confirmed"
  )) 

Confirmed cases of COVID-19 for countries where the infection rate has exceeded 5 cases/million. Data is only shown for the first 54 calendar days since passing 1 confirmed case/million. Note that from the day records begin in this dataset, the confirmed infection rate in Hubei was 7.59 confirmed cases/million. Diagonal grey lines indicate a doubling in the infection rate every 2, 4, or 8 days. To hide a country, click on the country in the plot legend. Clicking again on the country in the legend will restore the data within the plot. Countries are shown in order of the date at which they passed the 1 confirmed case/million mark. Due to the large number of countries shown, you may need to scroll through the legend. Regions of the plot are also able to be zoomed interactively. Please note the y-axis is shown on the logarithmic scale, so that a series of points which appear to be diagonal will indicate exponential growth. The flatter the line, the slower the growth and a perfectly horizontal line would indicate zero growth, or no new confirmed cases.

All figures and tables presented here simply aim to show an alternative viewpoint on the data. Every way to view COVID-19 data will mask important features, and the values shown here do not take into account vital factors such as population density, variability of infection across regions within countries, social culture and demographics. Many countries may not be directly comparable for a combination of the above factors. It is simply to view the data through the lens of a country’s population size using a value which should be easily interpretable.

In the above plot:

  • Only countries are shown where the infection rate has surpassed 5 cases/million. This was a completely arbitrary choice, but seemed likely to indicate a level of COVID-19 infection which posed noticeable community risk
  • The choice of “Days since passing 1 case/million” was also arbitrary, but seemed to be a useful way of enabling the comparison of COVID-19 progression across countries

Additionally, Australia’s spread of the virus appears marginally slower than many other countries.

  • This may be a reflection of Australian geography or our tendency to have large personal space requirements
  • This may also be a reflection of the pre-emptive strategies already being taken by the population and government
  • Overall, the Australian infection rate brings mild relief that we are doing a little better than many other countries. However testing rates are far lower than in countries like South Korea and many genuine cases are likely to be undetected
  • Despite the above, Australia is still clearly in the exponential growth phase with no sign of the growth levelling out. Infections are currently doubling in slightly less than 4 days, meaning Australian infections are expected to exceed 2,134 by Wednesday 25 March, 2020. If this case rate is not observed, it may be an early sign that distancing measures are beginning to be effective
  • Using an estimated population size of 25,499,884, the total percentage of the Australian population confirmed as infected currently sits at 0.004%.

Turning to other countries beyond Australia:

  • Only South Korea,and China appear to have brought the virus under some level of control. The infection rate currently sits around 168.8 per million for South Korea and 1145.9 per million for Hubei Province. This can also be thought of as 0.0169% and 0.1146% of the total populations respectively. Note that despite these apparently low percentages, the actual numbers of deaths and hospitalisations are extremely non-trivial and the real impacts of these numbers cannot be treated lightly.
  • Japan’s early measures also appear to be highly effective, however the recent growth in infections within Singapore should serve as a caution
  • Some of the above countries (e.g. Norway, Denmark) appear to be in the early stages of levelling out, indicating that the initial exponential growth phase may be ending in these countries.
  • Other countries such as Italy and the UK are clearly still in an exponential growth phase.
  • US numbers may also be unreliable given the low rates of testing and numerous anecdotal stories of suspected infections

Currently Active Infections

As an alternative viewpoint, the numbers of recovered and deceased patients have been removed from the above plot to provide an estimate of the currently active infections.

recovered <- url("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Recovered.csv") %>%
  read_csv() %>%
  pivot_longer(
    cols = ends_with("20"),
    names_to = "date",
    values_to = "recovered"
  )  %>%
  mutate(
    date = str_replace_all(
      date, "(.+)/(.+)/(.+)", "20\\3-\\1-\\2"
    ) %>%
      ymd()
  ) %>%
  dplyr::rename(
    Country = `Country/Region`
  ) %>%
  dplyr::mutate(
    Country = case_when(
      `Province/State` == "Hubei" ~ "China (Hubei)",
      `Province/State` == "Hong Kong" ~ "Hong Kong",
      grepl("China", Country) & !`Province/State` %in% c("Hubei", "Hong Kong") ~ "China (Other)",
      !grepl("China", Country) ~ Country
    )
  )
deaths <- url("https://raw.githubusercontent.com/CSSEGISandData/COVID-19/master/csse_covid_19_data/csse_covid_19_time_series/time_series_19-covid-Deaths.csv") %>%
  read_csv() %>%
  pivot_longer(
    cols = ends_with("20"),
    names_to = "date",
    values_to = "deaths"
  )  %>%
  mutate(
    date = str_replace_all(
      date, "(.+)/(.+)/(.+)", "20\\3-\\1-\\2"
    ) %>%
      ymd()
  ) %>%
  dplyr::rename(
    Country = `Country/Region`
  ) %>%
  dplyr::mutate(
    Country = case_when(
      `Province/State` == "Hubei" ~ "China (Hubei)",
      `Province/State` == "Hong Kong" ~ "Hong Kong",
      grepl("China", Country) & !`Province/State` %in% c("Hubei", "Hong Kong") ~ "China (Other)",
      !grepl("China", Country) ~ Country
    )
  )
# No recoveries are reported in the latestAU data
# Assume values are identical
if (updateAU){
  deaths %<>%
    bind_rows(
      dplyr::select(latestAU, any_of(colnames(.)))
    )
  recovered %<>%
    bind_rows(
      dplyr::select(latestAU, any_of(colnames(.)))
      ) %>% 
    arrange(`Province/State`, Country, date) %>%
    mutate(recovered = zoo::na.locf(recovered))
}
p2 <- confirmed %>%
  dplyr::filter(
    !str_detect(Country, "Other")
  ) %>%
  left_join(recovered) %>%
  left_join(deaths) %>%
  mutate(
    active = confirmed - recovered - deaths
  ) %>%
  dplyr::filter(
    !is.na(active)
  ) %>%
  group_by(
    Country, date
  ) %>%
  summarise(
    active = sum(active),
    confirmed = sum(confirmed),
    recovered = sum(recovered),
    deaths = sum(deaths)
  ) %>%
  right_join(pops) %>%
  mutate(rate = 1e6*active / Population)  %>%
  dplyr::filter(
    max(rate) > minToInclude
  ) %>%
  ungroup() %>%
  split(f = .$Country) %>%
  lapply(function(x, r = startingPoint){
    std <- min(dplyr::filter(x, rate > r)$date)
    x %>%
      dplyr::filter(date >= std) %>%
      mutate(days = date - std)
  }) %>%
  bind_rows() %>%
  mutate(
    days = as.integer(days),
    rate = round(rate, 2)
  ) %>%
  dplyr::filter(days <= nDays) %>%
  arrange(date) %>%
  mutate(Country = fct_inorder(Country)) %>%
  rename_all(str_to_title) %>%
  ggplot(
    aes(
      x = Days, y = Rate, colour = Country, 
      Date = Date, Active = Active,
      Confirmed = Confirmed, Recovered = Recovered,
      Deaths = Deaths
      )
  ) +
  geom_segment(
    aes(x, y, xend = xmax, yend = ymax),
    data = tibble(
      x = 0,
      y = 1,
      xmax = c(20.5, 41, nDays),
      ymax = 2^(xmax/ c(2, 4, 8))
    ),
    inherit.aes = FALSE,
    colour = "grey70",
    linetype = 2
  ) + 
  geom_text(
        aes(xmax, ymax, label = label),
    data = tibble(
      xmax = c(20.5, 41, nDays),
      rate = c(2, 4, 8),
      ymax = 2^(xmax/ rate),
      label = glue("Double in\n {rate} days")
    ),
    colour = "grey70",
    inherit.aes = FALSE
  ) +
  geom_point() +
  geom_line() +
  scale_y_log10() +
  xlab(
    paste(
      "Days since passing", 
      startingPoint, 
      "confirmed active case/million"
    )
  ) +
  ylab("Confirmed Active Infections (cases/million)")
ggplotly(p2) 

Confirmed active cases of COVID-19 for countries where the confirmed infection rate has exceeded 5 cases/million. Data is only shown for the first 54 calendar days since passing 1 confirmed case/million. Due to difficulties introduced by the currently reported low active infection rate outside Hubei province, data from China has been excluded from this plot, with the exception of Hubei. To hide a country, click on the country in the plot legend. Clicking again on the country in the legend will restore the data within the plot. Countries are shown in order of the date at which they passed the 1 confirmed active case/million mark. Due to the number of countries shown, you may need to scroll through the legend. Regions of the plot are also able to be zoomed interactively. Please note the y-axis is shown on the logarithmic scale, so that a series of points which appear to be diagonal will indicate exponential growth/decay. Given the different starting point to the previous plot, data will generally be shown for fewer time-points.

Notable features of this perspective are:

  • No deaths from confirmed cases have been reported in Singapore. Despite recent increases in active infections, has “flattening the curve” been successful?
  • Active infection rates in South Korea are currently declining
  • According to these figures, active infections within Hubei province are also significantly declining

Dimensional Reduction

In order to summarise which countries are the most similar to each other, a Principal Component Analysis was performed. This enables the multi-dimensional data of the above plots to summarised in two dimensions. As missing data cannot be included in this analysis, several countries which are at earlier comparative time-points than Australia were omitted from this analysis.

nDays <- confirmed %>% 
  dplyr::filter(Country == "Australia") %>% 
  group_by(Country, date) %>%
  summarise(confirmed = sum(confirmed)) %>%
  ungroup() %>%
  left_join(pops) %>%
  mutate(
    rate = 1e6*confirmed / Population
  ) %>% 
  dplyr::filter(rate > startingPoint) %>% 
  nrow() %>%
  subtract(1)
pca <- confirmed %>%
  group_by(Country, date) %>%
  summarise(confirmed = sum(confirmed)) %>%
  ungroup() %>%
  right_join(pops) %>%
  mutate(
    rate = 1e6*confirmed / Population
  ) %>%
  group_by(Country) %>%
  dplyr::filter(
    max(rate) > minToInclude
  ) %>%
  ungroup() %>%
  split(f = .$Country) %>%
  lapply(function(x, r = startingPoint){
    std <- min(dplyr::filter(x, rate > r)$date)
    x %>%
      dplyr::filter(date >= std) %>%
      mutate(days = date - std)
  }) %>%
  bind_rows() %>%
  mutate(
    days = as.integer(days),
    rate = round(rate, 2)
  ) %>%
  dplyr::filter(days <= nDays) %>%
  dplyr::select(Country, rate, days) %>%
  pivot_wider(
    values_from = rate,
    names_from = days
  ) %>%
  as.data.frame() %>%
  column_to_rownames("Country") %>%
  as.matrix() %>%
  .[!rowAnyNAs(.),] %>%
  log() %>%
  prcomp() 
set.seed(101)
pca$x %>%
  as.data.frame() %>%
  rownames_to_column("Country") %>%
  mutate(
    Cluster = cbind(PC1, PC2) %>%
      kmeans(k) %>%
      .[["cluster"]] %>%
      as.factor()
  ) %>%
  ggplot(aes(PC1, PC2, label = Country, colour = Cluster)) +
  geom_point() +
  geom_text_repel(
    show.legend = FALSE
  ) +
  stat_ellipse(
    aes(fill = Cluster),
    colour = NA,
    geom = "polygon",
    alpha = 0.1
    ) +
  xlab(
    paste0(
      "PC1 (", 
      percent(summary(pca)$importance["Proportion of Variance","PC1"], accuracy = 0.1), 
      ")"
      )
  ) +
    ylab(
    paste0(
      "PC2 (", 
      percent(summary(pca)$importance["Proportion of Variance","PC2"], accuracy = 0.1), 
      ")"
      )
  ) +
  theme(
    legend.position = "none"
  )
*Dimensional reduction showing which countries are most similar to each other at the 20 day mark since passing 1 case/million. The value 20 days was chosen as this marks how long Australia has been passed this point. Countries which have not progressed beyond 1 case/million for 20 days or more are not shown as dimensional reduction via PCA requires no missing values. Clusters were generated using k-means, manually specifying k =  2, and are not definitive. Principal Component 1, on the x-axis, corresponds to the greatest source of variability within the data (93.8%), and countries which appear near each other along this axis can be assumed to be showing similar growth in confirmed infection rates at this time point. Separation along the y-axis is less significant, but also worthy of note, as this represents 5.0% of variability within the data. At this early point, Australia's cumulative, confirmed infection rate is diverging from the cluster of countries which have responded well and is becoming more similar to Israel and other poor responding countries. This is suggestive that the early measures instituted in Australia may have been inadequate.*

Dimensional reduction showing which countries are most similar to each other at the 20 day mark since passing 1 case/million. The value 20 days was chosen as this marks how long Australia has been passed this point. Countries which have not progressed beyond 1 case/million for 20 days or more are not shown as dimensional reduction via PCA requires no missing values. Clusters were generated using k-means, manually specifying k = 2, and are not definitive. Principal Component 1, on the x-axis, corresponds to the greatest source of variability within the data (93.8%), and countries which appear near each other along this axis can be assumed to be showing similar growth in confirmed infection rates at this time point. Separation along the y-axis is less significant, but also worthy of note, as this represents 5.0% of variability within the data. At this early point, Australia’s cumulative, confirmed infection rate is diverging from the cluster of countries which have responded well and is becoming more similar to Israel and other poor responding countries. This is suggestive that the early measures instituted in Australia may have been inadequate.

Fatality, Recovery and Active Infection Rates

Summarising all available data from all countries:

  • The current fatality rate is 4.1%. This may be a function of under-reporting of true cases and is likely to be an overestimate.
  • The current recovery rate is 32.0%. Given the level of under-reporting this may also be highly inaccurate.
  • At this point, 63.8% of all confirmed infections are considered as ‘active’.
p4 <- confirmed %>%
  group_by(Country) %>%
  dplyr::filter(date == max(date)) %>%
  summarise(
    confirmed = sum(confirmed)
  ) %>%
  dplyr::filter(
    confirmed > 100,
    Country != "Cruise Ship"
  ) %>%   
  left_join(
    recovered %>%
      group_by(Country) %>%
      dplyr::filter(date == max(date)) %>%
      summarise(
        recovered = sum(recovered)
      )
  ) %>%
  left_join(
    deaths %>%
      group_by(Country) %>%
      dplyr::filter(date == max(date)) %>%
      summarise(
        deaths = sum(deaths)
      )
  ) %>%
  mutate(
    active = confirmed - recovered - deaths
  ) %>%
  mutate(
    active = 100*active / confirmed,
    recovered = 100*recovered / confirmed,
    fatalities = 100*deaths / confirmed
  ) %>%
  dplyr::filter(active < 100) %>%
  arrange(desc(confirmed)) %>%
  mutate(Country = fct_inorder(Country)) %>%
  pivot_longer(
    cols = c(active, recovered, fatalities),
    names_to = "Status",
    values_to = "Percentage"
  ) %>%
  mutate(
    Status = str_to_title(Status),
    Status = factor(
      Status, 
      levels = c("Active", "Recovered", "Fatalities")
    ),
    Percentage = round(Percentage, 2)
  ) %>%
  mutate(confirmed = comma(confirmed)) %>%
  rename(Confirmed = confirmed) %>%
  ggplot(
    aes(
      Country, Percentage,
      fill = Status, cases = Confirmed
    )
  ) +
  geom_col() +
  scale_fill_manual(
    values = c(
      Active = "blue",
      Recovered = "green",
      Fatalities = "red"
      )
  ) +
  scale_y_continuous(expand = expand_scale(0, 0)) +
  coord_flip() +
  theme(
    legend.position = "none"
  )
ggplotly(p4)

Fatality, Recovery and Active Infection rates for countries which have exceeded 100 confirmed cases. Countries are shown in order of the number of confirmed cases.

Fatality Rate Vs Time

fr <- confirmed %>%
  inner_join(deaths) %>%
  group_by(Country) %>%
  dplyr::filter(date == max(date)) %>%
  ungroup() %>%
  summarise(fr = sum(deaths) / sum(confirmed)) %>%
  .[["fr"]]
minDays <- 10
plotFr <- confirmed %>%
  inner_join(deaths) %>%
  group_by(Country, date) %>%
  summarise_at(
    vars(confirmed, deaths),
    sum
  ) %>%
  dplyr::filter(
    confirmed >= 100,
    max(deaths) > 0,
    Country != "Cruise Ship"
  ) %>%
  mutate(
    Days = date - min(date),
    n = n()
  ) %>%
  ungroup() %>%
  dplyr::filter(n > minDays) %>%
  arrange(desc(n)) %>%
  mutate(
    Country = fct_inorder(Country),
    Rate = deaths / confirmed,
    Days = as.integer(Days),
    `Fatality Rate` = percent(Rate, accuracy = 0.1)
  ) %>%
  rename_all(str_to_title) %>%
  ggplot(
    aes(
      x = Days, y = Rate, 
      colour = Country,
      conf = Confirmed,
      deaths = Deaths,
      label = `Fatality Rate`,
      dt = Date
      )
    ) +
  geom_line() +
  geom_hline(
    yintercept = fr,
    linetype = 2,
    colour = "grey50"
  ) +
  labs(
    x = "Days Since 100th Case",
    y = "Fatality Rate"
  ) +
  scale_y_continuous(labels = percent)
cpFr <- glue(
  "*Fatality Rate for confirmed cases shown as a function of time since the 100^th^ confirmed case.
  Only countries with {minDays} days of data beyond this time-point are shown.
  Countries with recorded fatalities have also been excluded. 
  The overall Fatality Rate for confirmed cases ({percent(fr, accuracy = 0.1)}) is shown as the horizontal grey line.*"
)
ggplotly(
  plotFr,
  tooltip = c(
    "Country", "Date", "Days", "Confirmed", "Deaths", 
    "Fatality Rate"
    )
  )

Fatality Rate for confirmed cases shown as a function of time since the 100th confirmed case. Only countries with 10 days of data beyond this time-point are shown. Countries with recorded fatalities have also been excluded. The overall Fatality Rate for confirmed cases (4.1%) is shown as the horizontal grey line.

Australian Specific Data

Australian State populations were taken from the ABS Website and were accurate in Sept 2019. The difference with previous estimates used above was within 0.04%, and as such no adjustments were made.

A series of complimentary charts regarding the spread of COVID-19 are available from the ABC website.

State-wise Comparison of Confirmed Infection Rates

ausPops <- tribble(
  ~State, ~Population,
  "New South Wales",    8117976,
  "Victoria", 6629870,
   "Queensland", 5115451,
  "South Australia", 1756494,
  "Western Australia", 2630557,
  "Tasmania", 535500,
  "Northern Territory", 245562,
  "Australian Capital Territory", 428060
)
minRate <- 4
p5 <- confirmed %>%
  dplyr::filter(
    Country == "Australia"
  ) %>%
  dplyr::rename(State = `Province/State`) %>%
  left_join(ausPops) %>%
  mutate(
    Rate = round(1e6*confirmed / Population, 2),
    Date = format.Date(date, "%d-%B")
  ) %>%
  dplyr::filter(
    !is.na(Population),
    !str_detect(State, "Territory"),
    Rate > minRate
  ) %>%
  arrange(date) %>%
  mutate(
    State = fct_inorder(State)
  ) %>%
  dplyr::rename(
    Confirmed = confirmed
  ) %>%
  ggplot(
    aes(
      x = date, y = Rate, colour = State, 
      label = Date, key = Confirmed
      )
  ) +
  geom_point() +
  geom_smooth(
    method = "lm", 
    se = FALSE,
    show.legend = FALSE
  ) +
  geom_line(linetype = 3) +
  scale_y_log10() +
  labs(
    x = "Date",
    y = "Confirmed Infection Rate (cases/million)"
  )
# Hide the tooltip from the regression lines
n <- length(levels(p5$data$State))
p5 <- ggplotly(p5, tooltip = c("Date", "Rate", "State", "Confirmed"))
regIndex <- seq(n + 1, length.out = n, by = 1)
p5$x$data[regIndex] <- lapply(
  p5$x$data[regIndex],
  function(x){
  x$hoverinfo <- "none"
  x
})
p5

Infection rates for each state with data beginning for each state once 4 confirmed cases /million was exceeded. Linear regression lines are shown for each state as solid lines, with NSW perhaps showing a slightly increased rate of infection within the population. Once again, the y-axis is on a logarithmic scale indicating exponential growth is occurring. States are shown in order of the initial date they passed 4 cases/million. Due to the low population sizes in the ACT and NT, these territories were omitted.

Statistical Analysis of State-wise Infection Rates

In order to test whether the infection rates are different between states, a linear regression model was fit. \(\log_{10}\)(Cumulative Confirmed Infection Rate) was assigned as the response variable with predictor variables being the State and Date. Each State was assigned its own intercept and slope by use of an interaction term (i.e. State:date). Given the potentially larger slope in NSW, this state was set as the baseline, with each other slope (i.e. interaction term) being presented as the difference in slope between each state and NSW. In this way comparisons against NSW were performed, but no comparisons between other states were performed.

Differences in the State-level intercepts are not particularly relevant, apart from capturing the initial time at which cumulative confirmed infection rate exceeded 4 cases / million. Differences between State-level slopes and NSW however, are of great interest, and as such only the slopes are shown. For NSW (Term = date) this captures the actual slope of the daily change in infection rate, whilst for all other States, this represents the difference between the daily change in infection rate for that State in comparison to NSW. Only differences in slope with an FDR-adjusted p-value < 0.05 are of particular interest. For those States which appear to be of interest, a negative value indicates an infection rate increasing more slowly than NSW, whilst a positive value indicates the opposite..

lm <- confirmed %>%
  dplyr::filter(
    Country == "Australia"
  ) %>%
  dplyr::rename(State = `Province/State`) %>%
  left_join(ausPops) %>%
  mutate(
    Rate = round(1e6*confirmed / Population, 2),
    Date = format.Date(date, "%d-%B")
  ) %>%
  dplyr::filter(
    !is.na(Population),
    !str_detect(State, "Territory"),
    Rate > minRate
  ) %>%
  arrange(desc(confirmed)) %>%
  mutate(
    State = fct_inorder(State)
  ) %>%
  dplyr::rename(
    Confirmed = confirmed
  ) %>%
  with(
    lm(log10(Rate) ~ (State + date)^2)
  ) 
lm %>%
  tidy() %>%
  mutate(
    FDR = p.adjust(p.value, method = "fdr"),
    term = str_remove_all(term, "State")
  ) %>%
  rename(
    Term = term,
    Estimate = estimate,
    SE = std.error,
    `T` = statistic,
    p = p.value
  ) %>%
  dplyr::filter(
    str_detect(Term, ":date")
  ) %>%
  mutate(
    Term = str_remove(Term, ":date"),
    Estimate = sprintf("%.4f", Estimate),
    SE = sprintf("%.3f", SE),
    `T` = sprintf("%.2f", `T`),
    p = case_when(
      p < 1e-4 ~ sprintf("%.2e", p),
      p >= 1e-4 ~ sprintf("%.4f", p)
    ),
    FDR = case_when(
      FDR < 1e-4 ~ sprintf("%.2e", FDR),
      FDR >= 1e-4 ~ sprintf("%.4f", FDR)
    )
  ) %>%
  pander(
    justify = "lrrrrr",
    emphasize.strong.rows = which(
      as.numeric(.$FDR) < 0.05 & .$Term != "date"
      ),
    caption = paste(
      "*Results of linear regression analysis", 
      "comparing the slopes of lines which track __daily",
      "change in cumulative confirmed infection rates__,",
      "scaled for population size in each state.",
      "Intercept terms are not shown.",
      "All states are shown in comparison to NSW.", 
      "Any highlighted state indicates an infection rate which is increasing at a different daily rate to NSW.",
      "Positive values in the Estimate column indicate a greater rate the NSW, whilst a negative value in the Estimate column indicates a slower growth rate than NSW.",
      "Diagnostic plots visually appeared",  
      "to be within acceptable bounds for real data,",
      "the fitted model passed the Shapiro-Wilk Test",
      "for normality", 
      paste0(
        "(p = ", 
        sprintf(
          "%.4f", shapiro.test(resid(lm))$p.value
        ),
        ")*."
      )
    )
  )
Results of linear regression analysis comparing the slopes of lines which track daily change in cumulative confirmed infection rates, scaled for population size in each state. Intercept terms are not shown. All states are shown in comparison to NSW. Any highlighted state indicates an infection rate which is increasing at a different daily rate to NSW. Positive values in the Estimate column indicate a greater rate the NSW, whilst a negative value in the Estimate column indicates a slower growth rate than NSW. Diagnostic plots visually appeared to be within acceptable bounds for real data, the fitted model passed the Shapiro-Wilk Test for normality (p = 0.0657).
Term Estimate SE T p FDR
Victoria 0.0050 0.007 0.72 0.4738 0.5685
Queensland 0.0153 0.007 2.22 0.0310 0.0465
Western Australia 0.0175 0.007 2.54 0.0141 0.0282
South Australia 0.0019 0.006 0.35 0.7301 0.7301
Tasmania -0.0166 0.006 -3.02 0.0040 0.0120
coef(lm) %>% 
  enframe(name = "Term", value = "coef") %>%
  dplyr::filter(str_detect(Term, "date")) %>%
  mutate(
    slope = case_when(
      str_detect(Term, "State") ~ coef[[1]] + coef,
      !str_detect(Term, "State") ~ coef
    ),
    Term = str_remove(Term, "State"),
    State = str_remove(Term, ":date"),
    State = str_replace(State, "date", "New South Wales")
    ) %>%
  left_join(
    confirmed %>% 
      dplyr::filter(
        Country == "Australia", 
        date == max(date)
      ) %>% 
      rename(State = `Province/State`) %>%
      dplyr::select(State, date, confirmed)
  ) %>%
  mutate(
    predicted = round(10^slope*confirmed, 0),
    slope = percent(10^slope, accuracy = 0.1),
  ) %>%
  dplyr::select(
    State, 
    `Expected Daily Increase` = slope,
    `Most Recent` = confirmed,
    `Predicted For Next Day` = predicted
    ) %>%
  pander(
    justify = "lrrr",
    caption = paste(
      "Results from linear regression shown as the",
      "expected daily increase in cases for the",
      "current exponential growth phase.",
      "Most recent values were taken from the official",
      "figures on",
      confirmed %>%
        dplyr::filter(Country == "Australia") %>%
        .[["date"]] %>%
        max() %>%
        format("%d %B, %Y."),
      "Predictions are for",
      confirmed %>%
        dplyr::filter(Country == "Australia") %>%
        .[["date"]] %>%
        max() %>%
        add(1) %>%
        format("%d %B, %Y."),
      "Standard errors for predictions have not been included."
    )
  )
Results from linear regression shown as the expected daily increase in cases for the current exponential growth phase. Most recent values were taken from the official figures on 21 March, 2020. Predictions are for 22 March, 2020. Standard errors for predictions have not been included.
State Expected Daily Increase Most Recent Predicted For Next Day
New South Wales 121.1% 436 528
Victoria 122.4% 229 280
Queensland 125.4% 221 277
Western Australia 126.0% 90 113
South Australia 121.6% 67 81
Tasmania 116.5% 16 19

Assessment of Testing Within Each State

Testing numbers were originally sourced from https://covid-19-au.github.io/, then were automatically updated should any changes indicated at https://api.infotorch.org/api/covid19/totals/. The number of tested individuals in each state was then assessed as a function of State population size. All results are valid at the time of report generation, however these numbers are initially dependent on media releases and may lag actual testing numbers by 24-48 hours.

tribble(
  ~State, ~Tested, 
  "New South Wales", 40651, 
  "Queensland", 29579,
  "Victoria", 19337,
  "Western Australia", 8603,
  "South Australia", 13000,
  "Tasmania", 807,
  "Australian Capital Territory", 2062
) %>%
  left_join(
    latestAU, 
    by = c("State"  = "Province/State")
    ) %>% 
  rowwise() %>% 
  mutate(
    Tested = max(Tested, tested, na.rm = TRUE)
    ) %>% 
  ungroup() %>% 
  dplyr::select(State, Tested) %>%
  left_join(
    confirmed %>%
      dplyr::filter(
        date == max(date)
      ) %>%
      rename(
        State = `Province/State`
      )
  ) %>%
  left_join(ausPops) %>%
  mutate(
    `Percent Tested` = Tested / Population,
    Positive = confirmed / Tested,
    Negative = 1 - Positive
  ) %>%
  dplyr::select(
    State, Population, `Total Tested` = Tested, contains("Percent"), ends_with("ive")
  ) %>%
  arrange(desc(`Percent Tested`)) %>%
  mutate_at(
    vars(contains("Percent"), ends_with("ive")),
    percent,
    accuracy = 0.01
  ) %>%
  pander(
    justify = "lrrrrr",
    caption = paste(
      "*COVID-19 testing scaled by State population size.",
      "Confirmed cases are assumed to be the tests",
      "returning a positive result.",
      "The current numbers available for SA are a", 
      "lower limit, so in reality, the proportion of",
      "the population tested is higher, as is the",
      "proportion of tests returning a negative result.*"
    )
  )
COVID-19 testing scaled by State population size. Confirmed cases are assumed to be the tests returning a positive result. The current numbers available for SA are a lower limit, so in reality, the proportion of the population tested is higher, as is the proportion of tests returning a negative result.
State Population Total Tested Percent Tested Positive Negative
South Australia 1,756,494 13,000 0.74% 0.52% 99.48%
Queensland 5,115,451 29,867 0.58% 0.74% 99.26%
New South Wales 8,117,976 46,892 0.58% 0.93% 99.07%
Australian Capital Territory 428,060 2,062 0.48% 0.29% 99.71%
Western Australia 2,630,557 8,603 0.33% 1.05% 98.95%
Victoria 6,629,870 19,337 0.29% 1.18% 98.82%
Tasmania 535,500 807 0.15% 1.98% 98.02%

R Session Information

R version 3.6.3 (2020-02-29)

Platform: x86_64-pc-linux-gnu (64-bit)

locale: LC_CTYPE=en_AU.UTF-8, LC_NUMERIC=C, LC_TIME=en_AU.UTF-8, LC_COLLATE=en_AU.UTF-8, LC_MONETARY=en_AU.UTF-8, LC_MESSAGES=en_AU.UTF-8, LC_PAPER=en_AU.UTF-8, LC_NAME=C, LC_ADDRESS=C, LC_TELEPHONE=C, LC_MEASUREMENT=en_AU.UTF-8 and LC_IDENTIFICATION=C

attached base packages: stats, graphics, grDevices, utils, datasets, methods and base

other attached packages: plotly(v.4.9.2), pander(v.0.6.3), jsonlite(v.1.6.1), glue(v.1.3.1), broom(v.0.5.4), ggrepel(v.0.8.1), matrixStats(v.0.55.0), scales(v.1.1.0), lubridate(v.1.7.4), magrittr(v.1.5), forcats(v.0.4.0), stringr(v.1.4.0), dplyr(v.0.8.4), purrr(v.0.3.3), readr(v.1.3.1), tidyr(v.1.0.2), tibble(v.2.1.3), ggplot2(v.3.2.1) and tidyverse(v.1.3.0)

loaded via a namespace (and not attached): httr(v.1.4.1), viridisLite(v.0.3.0), modelr(v.0.1.6), shiny(v.1.4.0), assertthat(v.0.2.1), highr(v.0.8), cellranger(v.1.1.0), yaml(v.2.2.1), pillar(v.1.4.3), backports(v.1.1.5), lattice(v.0.20-40), digest(v.0.6.25), promises(v.1.1.0), rvest(v.0.3.5), colorspace(v.1.4-1), htmltools(v.0.4.0), httpuv(v.1.5.2), pkgconfig(v.2.0.3), haven(v.2.2.0), xtable(v.1.8-4), later(v.1.0.0), generics(v.0.0.2), farver(v.2.0.3), ellipsis(v.0.3.0), withr(v.2.1.2), lazyeval(v.0.2.2), cli(v.2.0.1), crayon(v.1.3.4), readxl(v.1.3.1), mime(v.0.9), evaluate(v.0.14), fs(v.1.3.1), fansi(v.0.4.1), nlme(v.3.1-144), MASS(v.7.3-51.5), xml2(v.1.2.2), Cairo(v.1.5-10), tools(v.3.6.3), data.table(v.1.12.8), hms(v.0.5.3), lifecycle(v.0.1.0), munsell(v.0.5.0), reprex(v.0.3.0), compiler(v.3.6.3), rlang(v.0.4.4), grid(v.3.6.3), rstudioapi(v.0.11), htmlwidgets(v.1.5.1), crosstalk(v.1.0.0), labeling(v.0.3), rmarkdown(v.2.1), gtable(v.0.3.0), DBI(v.1.1.0), curl(v.4.3), R6(v.2.4.1), zoo(v.1.8-7), knitr(v.1.28), fastmap(v.1.0.1), stringi(v.1.4.6), Rcpp(v.1.0.3), vctrs(v.0.2.3), dbplyr(v.1.4.2), tidyselect(v.1.0.0) and xfun(v.0.12)